Existing learning-based multi-view stereo (MVS) methods rely on the depth range to build the 3D cost volume and may fail when the range is too large or unreliable. To address this problem, we propose a disparity-based MVS method based on the epipolar disparity flow (E-flow), called DispMVS, which infers the depth information from the pixel movement between two views. The core of DispMVS is to construct a 2D cost volume on the image plane along the epipolar line between each pair (between the reference image and several source images) for pixel matching and fuse uncountable depths triangulated from each pair by multi-view geometry to ensure multi-view consistency. To be robust, DispMVS starts from a randomly initialized depth map and iteratively refines the depth map with the help of the coarse-to-fine strategy. Experiments on DTUMVS and Tanks\&Temple datasets show that DispMVS is not sensitive to the depth range and achieves state-of-the-art results with lower GPU memory.
translated by 谷歌翻译
全景图像可以同时展示周围环境的完整信息,并且在虚拟旅游,游戏,机器人技术等方面具有许多优势。但是,全景深度估计的进度无法完全解决由常用的投射方法引起的失真和不连续性问题。本文提出了SphereDepth,这是一种新型的全景深度估计方法,该方法可直接预测球形网格的深度而无需投影预处理。核心思想是建立全景图像与球形网格之间的关系,然后使用深层神经网络在球形域上提取特征以预测深度。为了解决高分辨率全景数据带来的效率挑战,我们介绍了两个超参数,以平衡推理速度和准确性。在三个公共全景数据集中验证,SphereDepth通过全景深度估算的最新方法实现了可比的结果。从球形域设置中受益,球形部可以产生高质量的点云,并显着缓解失真和不连续性问题。
translated by 谷歌翻译
最近的高级研究花费了大量的人类努力来优化网络体系结构进行立体声匹配,但几乎无法实现高精度和快速推理速度。为了简化网络设计中的工作量,神经体系结构搜索(NAS)已在各种稀疏预测任务(例如图像分类和对象检测)上获得了巨大成功。但是,现有关于密集预测任务的NAS研究,尤其是立体声匹配,仍然无法在不同计算功能的设备上有效地部署。为此,我们建议对具有不同计算功能的设备上的各种3D体系结构设置进行立体匹配(EASNET)训练弹性和准确的网络,以支持各种3D体系结构设置。考虑到目标设备的部署延迟约束,我们可以在无需额外培训的情况下快速从全部EASNET中提取子网络,而仍可以维护子网的准确性。广泛的实验表明,在模型的准确性和推理速度方面,我们的Easnet优于现场流和MPI Sintel数据集的最先进的人设计和基于NAS的体系结构。特别是,部署在推理GPU上,Easnet在场景流数据集中以100毫秒的价格获得了新的SOTA EPE,比具有更好质量型号的Leastereo快4.5 $ \ times $。
translated by 谷歌翻译
在联合学习(FL)中,模型性能通常遭受数据异质性引起的客户漂移,而主流工作则专注于纠正客户漂移。我们提出了一种名为Virtual同质性学习(VHL)的不同方法,以直接“纠正”数据异质性。尤其是,VHL使用一个虚拟均匀的数据集进行FL,该数据集精心制作以满足两个条件:不包含私人信息和可分开的情况。虚拟数据集可以从跨客户端共享的纯噪声中生成,旨在校准异质客户的功能。从理论上讲,我们证明VHL可以在自然分布上实现可证明的概括性能。从经验上讲,我们证明了VHL赋予FL具有巨大改善的收敛速度和概括性能。VHL是使用虚拟数据集解决数据异质性的首次尝试,为FL提供了新的有效手段。
translated by 谷歌翻译
生成的对抗网络(GANS)已被证明在图像生成任务中非常成功,但GaN培训具有不稳定问题。许多作品通过手动修改GaN架构提高了GaN训练的稳定性,这需要人类专业知识和广泛的试验和错误。因此,目的是自动化模型设计的神经结构搜索(NAS)已经应用于在无条件图像生成的任务上搜索GAN。早期的NAS-GaN仅用于搜索生成器来减少困难。最近的一些作品试图搜索发电机(G)和鉴别器(D)来提高GaN性能,但它们仍然遭受搜索过程中GaN培训的不稳定性。为了缓解不稳定问题,我们提出了一种高效的两阶段进化算法(EA)基于NAS框架来发现GANS,Dubbed \ TextBF {eagan}。具体而言,我们将G和D的搜索分成两个阶段,提出了重量重置策略以提高GaN训练的稳定性。此外,我们执行进展操作以基于多个目标生成帕累托 - 前部架构,导致G和D的优越组合。通过利用重量分享策略和低保真评估,EAGAN可以显着缩短搜索时间。 EAGAN在CIFAR-10上实现了高竞争力的结果(= 8.81 $ \ PM $ 0.10,FID = 9.91),并超越了STL-10数据集上的先前NAS搜索的GAN(= 10.44 $ \ PM $ 0.087,FID = 22.18)。
translated by 谷歌翻译
联合学习(FL)是分布式学习范例,可以从边缘设备上的分散数据集中学习全局或个性化模型。然而,在计算机视觉域中,由于统一的流行框架缺乏探索,FL的模型性能远远落后于集中培训。在诸如物体检测和图像分割之类的高级计算机视觉任务中,FL很少有效地说明。为了弥合差距并促进电脑视觉任务的流动,在这项工作中,我们提出了一个联邦学习库和基准框架,命名为FEDCV,评估了三个最具代表性的计算机视觉任务:图像分类,图像分割,和物体检测。我们提供非I.I.D。基准测试数据集,模型和各种参考FL算法。我们的基准研究表明,存在多种挑战值得未来的探索:集中式培训技巧可能不会直接申请fl;非i.i.d。 DataSet实际上将模型精度降级到不同的任务中的某种程度;给出了联合培训的系统效率,具有挑战性,鉴于大量参数和每个客户端记忆成本。我们认为,这种图书馆和基准以及可比的评估设置是必要的,以便在计算机视觉任务中进行有意义的进展。 Fedcv公开可用:https://github.com/fedml-ai/fedcv。
translated by 谷歌翻译
COVID-19大流行威胁着全球健康。许多研究应用了深度卷积神经网络(CNN),以识别基于胸部3D计算机断层扫描(CT)的COVID-19。最近的作品表明,没有模型在不同国家 /地区的CT数据集中概括得很好,并且为特定数据集设计模型需要专业知识。因此,旨在自动搜索模型的神经体系结构搜索(NAS)已成为一个有吸引力的解决方案。为了降低大型3D CT数据集的搜索成本,大多数基于NAS的作品都使用权重共享(WS)策略来使所有型号在超级网中共享权重。但是,WS不可避免地会导致搜索不稳定性,从而导致模型估计不准确。在这项工作中,我们提出了一个有效的进化多目标架构搜索(EMARS)框架。我们提出了一个新的目标,即潜在的潜力,可以帮助利用有前途的模型间接减少权重训练中涉及的模型数量,从而减轻搜索不稳定性。我们证明,在准确性和潜力的目标下,EMAR可以平衡剥削和探索,即减少搜索时间并找到更好的模型。我们的搜索模型很小,并且比在三个公共Covid-19 3D CT数据集上的先前工作表现更好。
translated by 谷歌翻译
A further understanding of cause and effect within observational data is critical across many domains, such as economics, health care, public policy, web mining, online advertising, and marketing campaigns. Although significant advances have been made to overcome the challenges in causal effect estimation with observational data, such as missing counterfactual outcomes and selection bias between treatment and control groups, the existing methods mainly focus on source-specific and stationary observational data. Such learning strategies assume that all observational data are already available during the training phase and from only one source. This practical concern of accessibility is ubiquitous in various academic and industrial applications. That's what it boiled down to: in the era of big data, we face new challenges in causal inference with observational data, i.e., the extensibility for incrementally available observational data, the adaptability for extra domain adaptation problem except for the imbalance between treatment and control groups, and the accessibility for an enormous amount of data. In this position paper, we formally define the problem of continual treatment effect estimation, describe its research challenges, and then present possible solutions to this problem. Moreover, we will discuss future research directions on this topic.
translated by 谷歌翻译
Text-based speech editing allows users to edit speech by intuitively cutting, copying, and pasting text to speed up the process of editing speech. In the previous work, CampNet (context-aware mask prediction network) is proposed to realize text-based speech editing, significantly improving the quality of edited speech. This paper aims at a new task: adding emotional effect to the editing speech during the text-based speech editing to make the generated speech more expressive. To achieve this task, we propose Emo-CampNet (emotion CampNet), which can provide the option of emotional attributes for the generated speech in text-based speech editing and has the one-shot ability to edit unseen speakers' speech. Firstly, we propose an end-to-end emotion-selectable text-based speech editing model. The key idea of the model is to control the emotion of generated speech by introducing additional emotion attributes based on the context-aware mask prediction network. Secondly, to prevent the emotion of the generated speech from being interfered by the emotional components in the original speech, a neutral content generator is proposed to remove the emotion from the original speech, which is optimized by the generative adversarial framework. Thirdly, two data augmentation methods are proposed to enrich the emotional and pronunciation information in the training set, which can enable the model to edit the unseen speaker's speech. The experimental results that 1) Emo-CampNet can effectively control the emotion of the generated speech in the process of text-based speech editing; And can edit unseen speakers' speech. 2) Detailed ablation experiments further prove the effectiveness of emotional selectivity and data augmentation methods. The demo page is available at https://hairuo55.github.io/Emo-CampNet/
translated by 谷歌翻译
Photorealistic style transfer aims to transfer the artistic style of an image onto an input image or video while keeping photorealism. In this paper, we think it's the summary statistics matching scheme in existing algorithms that leads to unrealistic stylization. To avoid employing the popular Gram loss, we propose a self-supervised style transfer framework, which contains a style removal part and a style restoration part. The style removal network removes the original image styles, and the style restoration network recovers image styles in a supervised manner. Meanwhile, to address the problems in current feature transformation methods, we propose decoupled instance normalization to decompose feature transformation into style whitening and restylization. It works quite well in ColoristaNet and can transfer image styles efficiently while keeping photorealism. To ensure temporal coherency, we also incorporate optical flow methods and ConvLSTM to embed contextual information. Experiments demonstrates that ColoristaNet can achieve better stylization effects when compared with state-of-the-art algorithms.
translated by 谷歌翻译